Applied Metaphors: Learning TRIZ, Complexity, Data/Stats/ML using Metaphors
  1. Teaching
  2. Data Viz and Analytics
  3. Descriptive Analytics
  4. Groups
  • Teaching
    • Data Viz and Analytics
      • Tools
        • Introduction to R and RStudio
        • Introduction to Radiant
        • Introduction to Orange
      • Descriptive Analytics
        • Data
        • Graphs
        • Summaries
        • Counts
        • Quantities
        • Groups
        • Densities
        • Groups and Densities
        • Change
        • Proportions
        • Parts of a Whole
        • Evolution and Flow
        • Ratings and Rankings
        • Surveys
        • Time
        • Space
        • Networks
        • Experiments
        • Miscellaneous Graphing Tools, and References
      • Statistical Inference
        • 🧭 Basics of Statistical Inference
        • 🎲 Samples, Populations, Statistics and Inference
        • Basics of Randomization Tests
        • 🃏 Inference for a Single Mean
        • 🃏 Inference for Two Independent Means
        • 🃏 Inference for Comparing Two Paired Means
        • Comparing Multiple Means with ANOVA
        • Inference for Correlation
        • 🃏 Testing a Single Proportion
        • 🃏 Inference Test for Two Proportions
      • Inferential Modelling
        • Modelling with Linear Regression
        • Modelling with Logistic Regression
        • 🕔 Modelling and Predicting Time Series
      • Predictive Modelling
        • 🐉 Intro to Orange
        • ML - Regression
        • ML - Classification
        • ML - Clustering
      • Prescriptive Modelling
        • 📐 Intro to Linear Programming
        • 💭 The Simplex Method - Intuitively
        • 📅 The Simplex Method - In Excel
      • Workflow
        • Facing the Abyss
        • I Publish, therefore I Am
      • Using AI in Analytics
        • Case Studies
          • Demo:Product Packaging and Elderly People
          • Ikea Furniture
          • Movie Profits
          • Gender at the Work Place
          • Heptathlon
          • School Scores
          • Children's Games
          • Valentine’s Day Spending
          • Women Live Longer?
          • Hearing Loss in Children
          • California Transit Payments
          • Seaweed Nutrients
          • Coffee Flavours
          • Legionnaire’s Disease in the USA
          • Antarctic Sea ice
          • William Farr's Observations on Cholera in London
      • TRIZ for Problem Solvers
        • I am Water
        • I am What I yam
        • Birds of Different Feathers
        • I Connect therefore I am
        • I Think, Therefore I am
        • The Art of Parallel Thinking
        • A Year of Metaphoric Thinking
        • TRIZ - Problems and Contradictions
        • TRIZ - The Unreasonable Effectiveness of Available Resources
        • TRIZ - The Ideal Final Result
        • TRIZ - A Contradictory Language
        • TRIZ - The Contradiction Matrix Workflow
        • TRIZ - The Laws of Evolution
        • TRIZ - Substance Field Analysis, and ARIZ
      • Math Models for Creative Coders
        • Maths Basics
          • Vectors
          • Matrix Algebra Whirlwind Tour
          • content/courses/MathModelsDesign/Modules/05-Maths/70-MultiDimensionGeometry/index.qmd
        • Tech
          • Tools and Installation
          • Adding Libraries to p5.js
          • Using Constructor Objects in p5.js
        • Geometry
          • Circles
          • Complex Numbers
          • Fractals
          • Affine Transformation Fractals
          • L-Systems
          • Kolams and Lusona
        • Media
          • Fourier Series
          • Additive Sound Synthesis
          • Making Noise Predictably
          • The Karplus-Strong Guitar Algorithm
        • AI
          • Working with Neural Nets
          • The Perceptron
          • The Multilayer Perceptron
          • MLPs and Backpropagation
          • Gradient Descent
        • Projects
          • Projects
      • Tech for Creative Education
        • 🧭 Using Idyll
        • 🧭 Using Apparatus
        • 🧭 Using g9.js
      • Literary Jukebox: In Short, the World
        • Italy - Dino Buzzati
        • France - Guy de Maupassant
        • Japan - Hisaye Yamamoto
        • Peru - Ventura Garcia Calderon
        • Russia - Maxim Gorky
        • Egypt - Alifa Rifaat
        • Brazil - Clarice Lispector
        • England - V S Pritchett
        • Russia - Ivan Bunin
        • Czechia - Milan Kundera
        • Sweden - Lars Gustaffsson
        • Canada - John Cheever
        • Ireland - William Trevor
        • USA - Raymond Carver
        • Italy - Primo Levi
        • India - Ruth Prawer Jhabvala
        • USA - Carson McCullers
        • Zimbabwe - Petina Gappah
        • India - Bharati Mukherjee
        • USA - Lucia Berlin
        • USA - Grace Paley
        • England - Angela Carter
        • USA - Kurt Vonnegut
        • Spain-Merce Rodoreda
        • Israel - Ruth Calderon
        • Israel - Etgar Keret
    • Posts
    • Blogs and Talks

    On this page

    • Slides and Tutorials
    • Setting up R Packages
    • What graphs will we see today?
    • What kind of Data Variables will we choose?
    • Case Study-1: gss_wages dataset
      • Examine the Data
      • Data Dictionary
      • Hypothesis and Research Questions
      • Data Munging
    • Plotting Box Plots
    • Are the Differences Significant?
    • Wait, But Why?
    • Conclusion
    • Your Turn
    • AI Generated Summary and Podcast
    • References
    1. Teaching
    2. Data Viz and Analytics
    3. Descriptive Analytics
    4. Groups

    Groups

    Plotting Distributions over Categories

    Qual Variables
    Quant Variables
    Box Plots
    Violin Plots
    Author

    Arvind V.

    Published

    June 24, 2024

    Modified

    July 6, 2025

    Abstract
    Quant and Qual Variable Graphs and their Siblings

    Slides and Tutorials

    R (Static Viz)   Radiant Tutorial  Datasets

    “In keeping silent about evil, in burying it so deep within us that no sign of it appears on the surface, we are implanting it, and it will rise up a thousand fold in the future.”

    — Aleksandr Solzhenitsyn

    Setting up R Packages

    library(tidyverse)
    library(mosaic)
    library(ggformula)
    library(visStatistics) # All in one plot + stats test package
    library(palmerpenguins) # Our new favourite dataset
    ##
    library(tidyplots) # Easily Produced Publication-Ready Plots
    library(tinyplot) # Plots with Base R
    library(tinytable) # Elegant Tables for our data
    
    ## ggplot theme
    library(hrbrthemes)
    hrbrthemes::import_roboto_condensed() # Import Roboto Condensed font for use in charts
    hrbrthemes::update_geom_font_defaults() # Update matching font defaults for text geoms
    ggplot2::theme_set(new = theme_classic(base_family = "Roboto Condensed")) # Set consistent graph theme

    What graphs will we see today?

    Variable #1 Variable #2 Chart Names Chart Shape
    Quant Qual Box Plot

    What kind of Data Variables will we choose?

    No Pronoun Answer Variable/Scale Example What Operations?
    1 How Many / Much / Heavy? Few? Seldom? Often? When? Quantities, with Scale and a Zero Value.Differences and Ratios /Products are meaningful. Quantitative/Ratio Length,Height,Temperature in Kelvin,Activity,Dose Amount,Reaction Rate,Flow Rate,Concentration,Pulse,Survival Rate Correlation
    4 What, Who, Where, Whom, Which Name, Place, Animal, Thing Qualitative/Nominal Name Count no. of cases,Mode

    Inspiration

    Figure 1: Box Plot Inspiration

    Alice said, “I say what I mean and I mean what I say!” Are the rest of us so sure? What do we mean when we use any of the phrases above? How definite are we? There is a range of “sureness” and “unsureness”…and this is where we can use box plots like Figure 1 to show that range of opinion.

    Maybe it is time for a box plot on uh, shades1 of meaning for Jane Austen Gen-Z phrases! Bah.

    How do these Chart(s) Work?

    Box Plots are an extremely useful data visualization that gives us an idea of the distribution of a Quant variable, for each level of another Qual variable.

    Figure 2: Box Plot Definitions

    The internal process of this plot is as follows:

    (Hat tip to student Tanya Michelle Justin for a good question on outlier calculation)

    • Make groups of the Quant variable for each level of the Qual
    • In each group, rank the Quant variable values in increasing order
    • Calculate:
      • The values for median = Q2, Q1, and Q3 based on rank!!
      • Values for min, max, and then IQR = Q1 - Q3
      • Calculate outlier limits:
        • \([Q1 - 1.5*IQR, Q2 + 1.5*IQR]\)
      • Whiskers: All values within \([Q1 - 1.5*IQR, Q2 + 1.5*IQR]\)
      • Outliers: All values outside of \([Q1 - 1.5*IQR, Q2 + 1.5*IQR]\)
    • Plot these as a vertical or horizontal box structure, as shown.

    As a result of this, while the box-part of the boxplot always shows 2 full quartiles, the whiskers may not stretch through their quartiles, since some values may be outliers on either side.

    NoteRanks and Values

    The Quant variable is ordered based on the values from min to max. So you could imagine that each value has a rank or sequence number. The min value has \(rank = 1\) and the max value has \(rank = length(var)\).

    NoteHistograms and Box Plots

    Note how the histogram that dwells upon the mean and standard deviation, whereas the boxplot focuses on the median and quartiles. The former uses the values of the Quant variable, whereas the latter uses their sequence number or ranks.

    Box plots are often used for example in HR operations to understand Salary distributions across grades of employees. Marks of students in competitive exams are also declared using Quartiles.

    Box plots can show skew in distributions, with a large number of outliers on one side, as in Figure 3.

    Figure 3: Skewed Distributions and Boxplots

    In other cases, there may be no ouliers, but the “bottom” and the “lid” of the box may not be the same size!

    (a) Box Plot and Skewness
    (b) Density and Skewness
    Figure 4: Box Plot Discussions

    In the Figure 4, we see the difference between boxplots that show symmetric and skewed distributions. The “lid” and the “bottom” of the box are not of similar width in distributions with significant skewness.

    Compare these with the corresponding Figure 4 (b).

    Case Study-1: gss_wages dataset

    We will first look at Wage data from the General Social Survey (1974-2018) conducted in the USA, which is used to illustrate wage discrepancies by gender (while also considering respondent occupation, age, and education). This is available on Vincent Arel-Bundock’s superb repository of datasets. Let us read into R directly from the website.

    • R
    wages <- read_csv("https://vincentarelbundock.github.io/Rdatasets/csv/stevedata/gss_wages.csv")

    The data has automatically been read into the webr session, so you can continue on to the next code chunk!

    Examine the Data

    As per our Workflow, we will look at the data using all the three methods we have seen.

    • dplyr
    • skimr
    • mosaic
    • web-r
    glimpse(wages)
    Rows: 61,697
    Columns: 12
    $ rownames   <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …
    $ year       <dbl> 1974, 1974, 1974, 1974, 1974, 1974, 1974, 1974, 1974, 1974,…
    $ realrinc   <dbl> 4935, 43178, NA, NA, 18505, 22206, 55515, NA, NA, 4935, NA,…
    $ age        <dbl> 21, 41, 83, 69, 58, 30, 48, 67, 51, 54, 89, 71, 27, 30, 22,…
    $ occ10      <dbl> 5620, 2040, NA, NA, 5820, 910, 230, 6355, 4720, 3940, 4810,…
    $ occrecode  <chr> "Office and Administrative Support", "Professional", NA, NA…
    $ prestg10   <dbl> 25, 66, NA, NA, 37, 45, 59, 49, 28, 38, 47, 45, 50, 29, 33,…
    $ childs     <dbl> 0, 3, 2, 2, 0, 0, 2, 1, 2, 2, 3, 1, 4, 3, 0, 1, 2, 3, 4, 8,…
    $ wrkstat    <chr> "School", "Full-Time", "Housekeeper", "Housekeeper", "Full-…
    $ gender     <chr> "Male", "Male", "Female", "Female", "Female", "Male", "Male…
    $ educcat    <chr> "High School", "Bachelor", "Less Than High School", "Less T…
    $ maritalcat <chr> "Married", "Married", "Widowed", "Widowed", "Never Married"…
    skim(wages)
    Data summary
    Name wages
    Number of rows 61697
    Number of columns 12
    _______________________
    Column type frequency:
    character 5
    numeric 7
    ________________________
    Group variables None

    Variable type: character

    skim_variable n_missing complete_rate min max empty n_unique whitespace
    occrecode 3561 0.94 5 37 0 11 0
    wrkstat 21 1.00 5 23 0 8 0
    gender 0 1.00 4 6 0 2 0
    educcat 135 1.00 8 21 0 5 0
    maritalcat 27 1.00 7 13 0 5 0

    Variable type: numeric

    skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
    rownames 0 1.00 30849.00 17810.53 1 15425 30849 46273 61697.0 ▇▇▇▇▇
    year 0 1.00 1996.07 12.79 1974 1985 1996 2006 2018.0 ▆▇▇▇▇
    realrinc 23810 0.61 22326.36 28581.79 227 8156 16563 27171 480144.5 ▇▁▁▁▁
    age 219 1.00 46.18 17.56 18 32 44 59 89.0 ▇▇▆▅▂
    occ10 3561 0.94 4695.77 2627.72 10 2710 4720 6230 9997.0 ▃▅▇▂▃
    prestg10 4186 0.93 43.06 12.99 16 33 42 50 80.0 ▃▇▇▃▁
    childs 189 1.00 1.92 1.76 0 0 2 3 8.0 ▇▇▂▁▁
    inspect(wages)
    
    categorical variables:  
            name     class levels     n missing
    1  occrecode character     11 58136    3561
    2    wrkstat character      8 61676      21
    3     gender character      2 61697       0
    4    educcat character      5 61562     135
    5 maritalcat character      5 61670      27
                                       distribution
    1 Professional (19%), Service (16.9%) ...      
    2 Full-Time (49.4%), Housekeeper (15.1%) ...   
    3 Female (56.1%), Male (43.9%)                 
    4 High School (51.5%) ...                      
    5 Married (51.7%), Never Married (21.8%) ...   
    
    quantitative variables:  
          name   class  min    Q1 median    Q3      max         mean           sd
    1 rownames numeric    1 15425  30849 46273  61697.0 30849.000000 17810.534116
    2     year numeric 1974  1985   1996  2006   2018.0  1996.073715    12.794470
    3 realrinc numeric  227  8156  16563 27171 480144.5 22326.359234 28581.794499
    4      age numeric   18    32     44    59     89.0    46.176177    17.561065
    5    occ10 numeric   10  2710   4720  6230   9997.0  4695.774081  2627.724076
    6 prestg10 numeric   16    33     42    50     80.0    43.060701    12.987526
    7   childs numeric    0     0      2     3      8.0     1.923457     1.763569
          n missing
    1 61697       0
    2 61697       0
    3 37887   23810
    4 61478     219
    5 58136    3561
    6 57511    4186
    7 61508     189

    Data Dictionary

    From the dataset documentation page, we note that this is a large dataset (61K rows), with 11 variables:

    NoteQuantitative Data
    • year(dbl): the survey year
    • realrinc(dbl): the respondent’s base income (in constant 1986 USD
    • age(dbl): the respondent’s age in years
    • occ10(dbl): respondent’s occupation code (2010)
    • prestg10(dbl): respondent’s occupational prestige score (2010)
    • childs(dbl): number of children (0-8)
    NoteQualitative Data
    • occrecode(chr): recode of the occupation code into one of 11 main categories
    • wrkstat(chr): the work status of the respondent (full-time, part-time, temporarily not working, unemployed (laid off), retired, school, housekeeper, other). 8 levels. 
    • gender(chr): respondent’s gender (male or female). 2 levels.
    • educcat(chr): respondent’s degree level (Less Than High School, High School, Junior College, Bachelor, or Graduate). 5 levels.
    • maritalcat(chr): respondent’s marital status (Married, Widowed, Divorced, Separated, Never Married). 5 levels.
    NoteBusiness Insights based on wages dataset
    • Fair amount of missing data; however with 61K rows, we can for the present simply neglect the missing data.
    • Good mix of Qual and Quant variables

    Hypothesis and Research Questions

    • The target variable for an experiment that resulted in this data might be the realinc variable, the resultant income of the individual. Which is numerical variable.
    NoteResearch Questions:
    • What is the basic distribution of realrinc?
    • Is realrinc affected by gender?
    • By educcat? By maritalcat?
    • Is realrinc affected by child?
    • Do combinations of these factors have an effect on the target variable?

    These should do for now! But we should make more questions when have seen some plots!

    Data Munging

    Since there are so many missing data in the target variable realinc and there is still enough data leftover, let us remove the rows containing missing data in that variable.

    Important

    NOTE: This is not advised at all as a general procedure!! Data is valuable and there are better ways to manage this problem!

    wages_clean <-
      wages %>%
      tidyr::drop_na(realrinc) # choose column or leave blank to choose all

    Plotting Box Plots

    Question-1: What is the basic distribution of realrinc?

    NoteQuestion-1: What is the basic distribution of realrinc?
    • Using ggformula
    • Using ggplot
    • web-r
    ggplot2::theme_set(new = theme_classic(base_family = "Roboto Condensed")) # Set consistent graph theme
    
    wages_clean %>%
      gf_boxplot(realrinc ~ "Income") %>% # Dummy X-axis "variable"
      gf_labs(
        title = "Plot 1A: Income has a skewed distribution",
        subtitle = "Many outliers on the high side"
      )

    ggplot2::theme_set(new = theme_classic(base_family = "Roboto Condensed")) # Set consistent graph theme
    wages_clean %>%
      ggplot() +
      geom_boxplot(aes(y = realrinc, x = "Income")) + # Dummy X-axis "variable"
      labs(
        title = "Plot 1A: Income has a skewed distribution",
        subtitle = "Many outliers on the high side"
      )

    Business Insights-1

    • Income is a very skewed distribution, as might be expected.
    • Presence of many higher-side outliers is noted.

    Question-2: Is realrinc affected by gender?

    NoteQuestion-2: Is realrinc affected by gender?
    • Using ggformula
    • Using ggplot
    • web-r
    ggplot2::theme_set(new = theme_classic(base_family = "Roboto Condensed")) # Set consistent graph theme
    
    wages_clean %>%
      gf_boxplot(gender ~ realrinc) %>%
      gf_labs(title = "Plot 2A: Income by Gender")

    Split by Gender

    Split by Gender
    ggplot2::theme_set(new = theme_classic(base_family = "Roboto Condensed")) # Set consistent graph theme
    wages_clean %>%
      gf_boxplot(gender ~ log10(realrinc)) %>%
      gf_labs(title = "Plot 2B: Log(Income) by Gender")

    With log income

    With log income
    ggplot2::theme_set(new = theme_classic(base_family = "Roboto Condensed")) # Set consistent graph theme
    wages_clean %>%
      gf_boxplot(gender ~ realrinc, fill = ~gender) %>%
      gf_refine(scale_x_log10()) %>%
      gf_labs(title = "Plot 2C: Income filled by Gender, log scale")

    With log scale

    With log scale
    ggplot2::theme_set(new = theme_classic(base_family = "Roboto Condensed")) # Set consistent graph theme
    wages_clean %>%
      ggplot() +
      geom_boxplot(aes(y = gender, x = realrinc)) +
      labs(title = "Plot 2A: Income by Gender")
    ##
    wages_clean %>%
      ggplot() +
      geom_boxplot(aes(y = gender, x = log10(realrinc))) +
      labs(title = "Plot 2B: Log(Income) by Gender")
    ##
    wages_clean %>%
      ggplot() +
      geom_boxplot(aes(y = gender, x = realrinc, fill = gender)) +
      gf_refine(scale_x_log10()) +
      labs(title = "Plot 2C: Income filled by Gender, log scale")
    (a) Split by Gender
    (b) With log income
    (c) With log scale
    Figure 5: Income by Gender

    Business Insights-2

    • Even when split by gender, realincome presents a skewed set of distributions.
    • The IQR for males is smaller than the IQR for females. There is less variation in the middle ranges of realrinc for men.
    • log10 transformation helps to view and understand the regions of low realrinc.
    • There are outliers on both sides, indicating that there may be many people who make very small amounts of money and large amounts of money in both genders.

    Question-3: Is realrinc affected by educcat?

    NoteQuestion-3: Is realrinc affected by educcat?
    • Using ggformula
    • Using ggplot
    • web-r
    ggplot2::theme_set(new = theme_classic(base_family = "Roboto Condensed")) # Set consistent graph theme
    
    wages_clean %>%
      gf_boxplot(educcat ~ realrinc) %>%
      gf_labs(title = "Plot 3A: Income by Education Category")

    Split by Education Category

    Split by Education Category
    ggplot2::theme_set(new = theme_classic(base_family = "Roboto Condensed")) # Set consistent graph theme
    
    wages_clean %>%
      gf_boxplot(educcat ~ log10(realrinc)) %>%
      gf_labs(title = "Plot 3B: Log(Income) by Education Category")

    With log income

    With log income
    ggplot2::theme_set(new = theme_classic(base_family = "Roboto Condensed")) # Set consistent graph theme
    
    wages_clean %>%
      gf_boxplot(
        reorder(educcat, realrinc, FUN = median) ~ log(realrinc),
        fill = ~educcat,
        alpha = 0.3
      ) %>%
      gf_labs(title = "Plot 3C: Log(Income) by Education Category, sorted") %>%
      gf_labs(
        x = "Log Income",
        y = "Education Category"
      )

    With log income

    With log income
    ggplot2::theme_set(new = theme_classic(base_family = "Roboto Condensed")) # Set consistent graph theme
    
    wages_clean %>%
      gf_boxplot(reorder(educcat, realrinc, FUN = median) ~ realrinc,
        fill = ~educcat,
        alpha = 0.5
      ) %>%
      gf_refine(scale_x_log10()) %>%
      gf_labs(
        title = "Plot 3D: Income by Education Category, sorted",
        subtitle = "Log Income"
      ) %>%
      gf_labs(
        x = "Income",
        y = "Education Category"
      )

    Log Income Scale

    Log Income Scale
    ggplot2::theme_set(new = theme_classic(base_family = "Roboto Condensed")) # Set consistent graph theme
    
    wages_clean %>%
      ggplot() +
      geom_boxplot(aes(realrinc, educcat)) + # (x,y) format
      labs(title = "Plot 3A: Income by Education Category")
    ##
    wages_clean %>%
      ggplot() +
      geom_boxplot(aes(log10(realrinc), educcat)) +
      labs(title = "Plot 3B: Log(Income) by Education Category")
    ##
    wages_clean %>%
      ggplot() +
      geom_boxplot(
        aes(log(realrinc),
          reorder(educcat, realrinc, FUN = median),
          fill = educcat
        ),
        alpha = 0.3
      ) +
      labs(
        title = "Plot 3C: Log(Income) by Education Category, sorted",
        x = "Log Income", y = "Education Category"
      )
    ##
    wages_clean %>%
      ggplot() +
      geom_boxplot(
        aes(realrinc,
          reorder(educcat, realrinc, FUN = median),
          fill = educcat
        ),
        alpha = 0.3
      ) +
      scale_x_log10() +
      labs(
        title = "Plot 3D: Income by Education Category, sorted",
        subtitle = "Log Income Scale",
        x = "Income", y = "Education Category"
      )
    (a) Split by Education Category
    (b) With log income
    (c) With log scale
    (d) Split by Education Category
    Figure 6: Income by Education Category

    Business Insights-3

    • realrinc rises with educcat, which is to be expected.
    • However, there are people with very low and very high income in all categories of educcat
    • Hence educcat alone may not be a good predictor for realrinc.

    We can do similar work with the other Qual variables. Let us now see how we can use more than one Qual variable and answer the last hypothesis, Question 4.

    Question-4: Is the target variable realrinc affected by combinations of Qual factors gender, educcat, maritalcat and childs?

    Important

    This is a rather complex question and could take us deep into Modelling. Ideally we ought to:

    • take each Qual variable, explain its effect on the target variable
    • remove that effect and model the remainder ( i.e. residual) with the next Qual variable
    • Proceed in this way until we have a good model.
      if we are going to do this manually.

    There are more modern Modelling Workflows, that can do things much faster and without such manual tweaking.

    So will simply plot box plots showing effects on the target variable of combinations of Qual variables taken two at a time. (We will of course use facetted box plots!)

    We will also drop NA values all around this time, to avoid seeing boxplots for undocumented categories.

    NoteQuestion-4: Is realrinc affected by combinations of factors?
    • Using ggformula
    • Using ggplot
    • web-r
    ggplot2::theme_set(new = theme_classic(base_family = "Roboto Condensed")) # Set consistent graph theme
    
    wages %>%
      drop_na() %>%
      gf_boxplot(reorder(educcat, realrinc) ~ log10(realrinc),
        fill = ~educcat,
        alpha = 0.5
      ) %>%
      gf_facet_wrap(vars(childs)) %>%
      gf_refine(scale_fill_brewer(type = "qual", palette = "Dark2")) %>%
      gf_labs(
        title = "Plot 4A: Log Income by Education Category and Family Size",
        x = "Log income",
        y = "No. of Children"
      )
    Figure 7: Split by Education Category and Family Size
    ggplot2::theme_set(new = theme_classic(base_family = "Roboto Condensed")) # Set consistent graph theme
    
    wages %>%
      drop_na() %>%
      mutate(childs = as_factor(childs)) %>%
      gf_boxplot(childs ~ log10(realrinc),
        group = ~childs,
        fill = ~childs,
        alpha = 0.5
      ) %>%
      gf_facet_wrap(~gender) %>%
      gf_refine(scale_fill_brewer(type = "qual", palette = "Set3")) %>%
      gf_labs(
        title = "Plot 4B: Log Income by Gender and Family Size",
        x = "Log income",
        y = "No. of Children"
      )
    Figure 8: Split by Gender and Family Size
    ggplot2::theme_set(new = theme_classic(base_family = "Roboto Condensed")) # Set consistent graph theme
    
    wages %>%
      drop_na() %>%
      ggplot() +
      geom_boxplot(
        aes(log10(realrinc), reorder(educcat, realrinc),
          fill = educcat
        ), # aes() closes here
        alpha = 0.5
      ) +
      facet_wrap(vars(childs)) +
      scale_fill_brewer(type = "qual", palette = "Dark2") +
      labs(title = "Plot 4A: Log Income by Education Category and Family Size", x = "Log income", y = "No. of Children")
    ##
    wages %>%
      drop_na() %>%
      mutate(childs = as_factor(childs)) %>%
      ggplot() +
      geom_boxplot(
        aes(log10(realrinc), childs,
          group = childs,
          fill = childs
        ), # aes() closes here
        alpha = 0.5
      ) +
      facet_wrap(vars(gender)) +
      scale_fill_brewer(type = "qual", palette = "Set3") +
      labs(
        title = "Plot 4B: Log Income by Gender and Family Size",
        x = "Log income",
        y = "No. of Children"
      )
    (a) Split by Education Category and Family Size
    (b) Split by Gender and Family Size
    Figure 9: Income and Other Qual Variables

    Business Insights-4

    • From Figure 7, we see that realrinc increases with educcat, across (almost) all family sizes childs.
    • However, this trend breaks a little when family sizes childs is large, say >= 7. Be aware that the data observations for such large families may be sparse and this inference may not be necessarily valid.
    • From Figure 8, we see that the effect of childs on realrinc is different for each gender! For females, the income steadily drops with the number of children, whereas for males it actually increases up to a certain family size before decreasing again.

    Are the Differences Significant?

    ImportantHunches and Hypotheses

    In data analysis, we always want to know2, as in life, how important things are, whether they matter. To do this, we make up hunches, or more precisely, Hypotheses. We make two in fact:

    \(H_0\): Nothing is happening;
    \(H_a\): (“a” for Alternate): Something is happening and it is important enough to pay attention to.

    We then pretend that \(H_0\) is true and ask that our data prove us wrong; if it does, we reject \(H_0\) in favour of \(H_a\).

    This is a very important idea of Hypothesis Testing which helps you justify your hunch. We will study this when we do Stats Tests for differences between two means(t-tests), and those between more than two means(ANOVA).

    Wait, But Why?

    • Box plots are a powerful statistical graphic that give us a combined view of data ranges, quartiles, medians, and outliers.
    • Box plots can compare groups within our Quant variable, based on levels of a Qual variable. This is a very common and important task in research!
    • In your design research, you would have numerical Quant data that is accompanied by categorical Qual data pertaining to groups within your target audience.
    • Analyzing for differences in the Quant across levels of the Qual (e.g household expenditure across groups of people) is a vital step in justifying time, effort, and money for further actions in your project. Don’t faff this.
    • Box plots are ideal for visualizing statistical tests for difference in mean values across groups (t-test and ANOVA). (Even though they plot medians)

    Conclusion

    • Box Plots “dwell upon” medians and Quartiles
    • Box Plots can show distributions of a Quant variable over levels of a Qual variable
    • This allows a comparison of box plots side by side to visibly detect differences in medians and IQRs across such levels.

    Your Turn

    Here are a couple of datasets that you might want to analyze with box plots:

    NoteInsurance Data

    NotePolitical Donations

    NoteUFO Encounters

    The data dictionary for this dataset is here at the TidyTuesday Website.. The TidyTuesday Website is a treasure trove of interesting datasets!

    NoteGPT-based Language detectors are biased against non-native English writers.

    What story can you tell, and deduction can you make from Figure 10 below? How would you replicate it? What would you add?

    Figure 10: AI Detectors

    AI Generated Summary and Podcast

    This excerpt from “Groups – Applied Metaphors: Learning TRIZ, Complexity, Data/Stats/ML using Metaphors” provides a comprehensive guide to understanding and utilizing box plots for data visualization and analysis. The text explores the purpose, functionality, and application of box plots within the context of exploring relationships between quantitative and qualitative variables. The author illustrates these concepts using a case study of the “gss_wages” dataset, examining wage discrepancies by gender, occupation, age, and education. Through this analysis, the author highlights the effectiveness of box plots in visualizing distributions, identifying outliers, and comparing groups, providing valuable insights into the complexities of data. The text concludes with a call to action, encouraging readers to explore real-world datasets and apply these techniques to uncover hidden trends and patterns within data.

    • What are the relationships between qualitative and quantitative variables in the gss_wages dataset?

    • How do box plots help visualize and understand the distribution of income across different groups?

    • What insights can be gained by analyzing the impact of multiple qualitative factors on income distribution?

    Your browser does not support the audio tag; for browser support, please see: https://www.w3schools.com/tags/tag_audio.asp

    References

    1. Winston Chang (2024). R Graphics Cookbook. https://r-graphics.org
    2. Bevans, R. (2023, June 22). An Introduction to t Tests | Definitions, Formula and Examples. Scribbr. https://www.scribbr.com/statistics/t-test/
    3. Brown, Angus. (2008). The Strange Origins of the t-test. Physiology News | No. 71 | Summer 2008| https://static.physoc.org/app/uploads/2019/03/22194755/71-a.pdf
    4. Stephen T. Ziliak.(2008). Guinnessometrics: The Economic Foundation of “Student’s” t. Journal of Economic Perspectives—Volume 22, Number 4—Fall 2008—Pages 199–216. https://pubs.aeaweb.org/doi/pdfplus/10.1257/jep.22.4.199
    5. https://quillette.com/2024/08/03/xy-athletes-in-womens-olympic-boxing-paris-2024-controversy-explained-khelif-yu-ting/
    6. Senefeld JW, Lambelet Coleman D, Johnson PW, Carter RE, Clayburn AJ, Joyner MJ. Divergence in Timing and Magnitude of Testosterone Levels Between Male and Female Youths. JAMA. 2020;324(1):99–101. doi:10.1001/jama.2020.5655. https://jamanetwork.com/journals/jama/fullarticle/2767852
    7. Doriane Lambelet Coleman.(2017) Sex in Sport, 80 Law and Contemporary Problems. Available at: https://scholarship.law.duke.edu/lcp/vol80/iss4/5
    8. Distributome - An Interactive Web-based Resource for Probability Distributions https://distributome.org
    R Package Citations
    Package Version Citation
    ggridges 0.5.6 Wilke (2024)
    NHANES 2.1.0 Pruim (2015)
    TeachHist 0.2.1 Lange (2023)
    TeachingDemos 2.13 Snow (2024)
    tidyplots 0.2.2 Engler (2024)
    tinyplot 0.4.1 McDermott, Arel-Bundock, and Zeileis (2025)
    tinytable 0.9.0 Arel-Bundock (2025)
    visStatistics 0.1.7 Schilling (2025)
    visualize 4.5.0 Balamuta (2023)
    Arel-Bundock, Vincent. 2025. tinytable: Simple and Configurable Tables in “HTML,” “LaTeX,” “Markdown,” “Word,” “PNG,” “PDF,” and “Typst” Formats. https://doi.org/10.32614/CRAN.package.tinytable.
    Balamuta, James. 2023. visualize: Graph Probability Distributions with User Supplied Parameters and Statistics. https://doi.org/10.32614/CRAN.package.visualize.
    Engler, Jan Broder. 2024. “Tidyplots Empowers Life Scientists with Easy Code-Based Data Visualization.” bioRxiv. https://doi.org/10.1101/2024.11.08.621836.
    Lange, Carsten. 2023. TeachHist: A Collection of Amended Histograms Designed for Teaching Statistics. https://doi.org/10.32614/CRAN.package.TeachHist.
    McDermott, Grant, Vincent Arel-Bundock, and Achim Zeileis. 2025. tinyplot: Lightweight Extension of the Base r Graphics System. https://doi.org/10.32614/CRAN.package.tinyplot.
    Pruim, Randall. 2015. NHANES: Data from the US National Health and Nutrition Examination Study. https://doi.org/10.32614/CRAN.package.NHANES.
    Schilling, Sabine. 2025. visStatistics: Automated Selection and Visualisation of Statistical Hypothesis Tests. https://doi.org/10.32614/CRAN.package.visStatistics.
    Snow, Greg. 2024. TeachingDemos: Demonstrations for Teaching and Learning. https://doi.org/10.32614/CRAN.package.TeachingDemos.
    Wilke, Claus O. 2024. ggridges: Ridgeline Plots in “ggplot2”. https://doi.org/10.32614/CRAN.package.ggridges.
    Back to top

    Footnotes

    1. The term throwing a shade can be found in Jane Austen’s novel Mansfield Park (1814). Young Edmund Bertram is displeased with a dinner guest’s disparagement of the uncle who took her in: “With such warm feelings and lively spirits it must be difficult to do justice to her affection for Mrs. Crawford, without throwing a shade on the Admiral.”↩︎

    2. “Ah, Misha, he has a stormy spirit. His mind is in bondage. He is haunted by a great, unsolved doubt. He is one of those who don’t want millions, but an answer to their questions.” ― Fyodor Dostoevsky, The Brothers Karamazov: A Novel in Four Parts With Epilogue↩︎

    Citation

    BibTeX citation:
    @online{v.2024,
      author = {V., Arvind},
      title = {\textless Iconify-Icon Icon=“lucide:group” Width=“1.2em”
        Height=“1.2em”\textgreater\textless/Iconify-Icon\textgreater{}
        {Groups}},
      date = {2024-06-24},
      url = {https://av-quarto.netlify.app/content/courses/Analytics/Descriptive/Modules/24-BoxPlots/},
      langid = {en},
      abstract = {Quant and Qual Variable Graphs and their Siblings}
    }
    
    For attribution, please cite this work as:
    V., Arvind. 2024. “<Iconify-Icon Icon=‘lucide:group’ Width=‘1.2em’ Height=‘1.2em’></Iconify-Icon> Groups.” June 24, 2024. https://av-quarto.netlify.app/content/courses/Analytics/Descriptive/Modules/24-BoxPlots/.
    Quantities
    Densities

    License: CC BY-SA 2.0

    Website made with ❤️ and Quarto, by Arvind V.

    Hosted by Netlify .